Biostat 203B Homework 2

Due Feb 9 @ 11:59PM

Author

Feiyang Huang, UID 306148942

Display machine information for reproducibility:

sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.2 compiler_4.2.3    fastmap_1.1.1     cli_3.6.1        
 [5] tools_4.2.3       htmltools_0.5.5   rstudioapi_0.14   yaml_2.3.7       
 [9] rmarkdown_2.23    knitr_1.43        jsonlite_1.8.7    xfun_0.39        
[13] digest_0.6.32     rlang_1.1.1       evaluate_0.21    

Load necessary libraries (you can add more as needed).

library(arrow)
library(data.table)
library(memuse)
library(pryr)
library(R.utils)
library(tidyverse)

Display memory information of your computer

memuse::Sys.meminfo()
Totalram:    8.000 GiB 
Freeram:   154.203 MiB 

In this exercise, we explore various tools for ingesting the MIMIC-IV data introduced in homework 1.

Display the contents of MIMIC hosp and icu data folders:

ls -l ~/Documents/203B/mimic/hosp/
total 35676328
-rw-rw-r--@ 1 huangfeifei  staff     15516088 Jan  5  2023 admissions.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff       427468 Jan  5  2023 d_hcpcs.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff       859438 Jan  5  2023 d_icd_diagnoses.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff       578517 Jan  5  2023 d_icd_procedures.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff        12900 Jan  5  2023 d_labitems.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff     25070720 Jan  5  2023 diagnoses_icd.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff      7426955 Jan  5  2023 drgcodes.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff    508524623 Jan  5  2023 emar.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff    471096030 Jan  5  2023 emar_detail.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff      1767138 Jan  5  2023 hcpcsevents.csv.gz
-rw-r--r--  1 huangfeifei  staff  13730083993 Jan 30 16:36 labevents.csv
-rw-rw-r--@ 1 huangfeifei  staff   1939088924 Jan  5  2023 labevents.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff     96698496 Jan  5  2023 microbiologyevents.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff     36124944 Jan  5  2023 omr.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff      2312631 Jan  5  2023 patients.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff    398753125 Jan  5  2023 pharmacy.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff    498505135 Jan  5  2023 poe.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff     25477219 Jan  5  2023 poe_detail.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff    458817415 Jan  5  2023 prescriptions.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff      6027067 Jan  5  2023 procedures_icd.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff       122507 Jan  5  2023 provider.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff      6781247 Jan  5  2023 services.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff     36158338 Jan  5  2023 transfers.csv.gz
ls -l ~/Documents/203B/mimic/icu/
total 6155968
-rw-rw-r--@ 1 huangfeifei  staff       35893 Jan  5  2023 caregiver.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff  2467761053 Jan  5  2023 chartevents.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff       57476 Jan  5  2023 d_items.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff    45721062 Jan  5  2023 datetimeevents.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff     2614571 Jan  5  2023 icustays.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff   251962313 Jan  5  2023 ingredientevents.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff   324218488 Jan  5  2023 inputevents.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff    38747895 Jan  5  2023 outputevents.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff    20717852 Jan  5  2023 procedureevents.csv.gz

Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)

Q1.1 Speed, memory, and data types

There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three functions: read.csv in base R, read_csv in tidyverse, and fread in the data.table package.

Which function is fastest? Is there difference in the (default) parsed data types? How much memory does each resultant dataframe or tibble use? (Hint: system.time measures run times; pryr::object_size measures memory usage.)

Answer: fread and read_csv are similar, and the slowest is read.csv . read_csv and fread use very similar amount of memory, while read.csv uses more memory than read_csv and fread. There can be some differences between the functions regarding the default parsed data types. read.csv typically parses all columns as character vectors unless specifies the stringsAsFactors argument is TRUE, in which case parsed columns can be converted to factors. read_csv infers the data types of columns based on their content and returns a tibble where columns can have different data types (e.g., character, numeric, etc.). fread infers the data types of columns but may handle some types differently than the other functions. For example, it often parses character columns as factors by default.

system.time(read.csv("~/Documents/203B/mimic/hosp/admissions.csv.gz"))
   user  system elapsed 
  3.943   0.088   4.048 
system.time(read_csv("~/Documents/203B/mimic/hosp/admissions.csv.gz"))
   user  system elapsed 
  1.783   0.602   0.932 
system.time(fread("~/Documents/203B/mimic/hosp/admissions.csv.gz"))
   user  system elapsed 
  0.598   0.109   0.732 
object_size(read.csv("~/Documents/203B/mimic/hosp/admissions.csv.gz"))
158.71 MB
object_size(read_csv("~/Documents/203B/mimic/hosp/admissions.csv.gz"))
55.31 MB
object_size(fread("~/Documents/203B/mimic/hosp/admissions.csv.gz"))
50.13 MB

Q1.2 User-supplied data types

Re-ingest admissions.csv.gz by indicating appropriate column data types in read_csv. Does the run time change? How much memory does the result tibble use? (Hint: col_types argument in read_csv.)

Answer: The run time doesn’t change a lot but the memory usage is less.

read_csv("~/Documents/203B/mimic/hosp/admissions.csv.gz") |> 
  print(width = Inf)
# A tibble: 431,231 × 16
   subject_id  hadm_id admittime           dischtime           deathtime
        <dbl>    <dbl> <dttm>              <dttm>              <dttm>   
 1   10000032 22595853 2180-05-06 22:23:00 2180-05-07 17:15:00 NA       
 2   10000032 22841357 2180-06-26 18:27:00 2180-06-27 18:49:00 NA       
 3   10000032 25742920 2180-08-05 23:44:00 2180-08-07 17:50:00 NA       
 4   10000032 29079034 2180-07-23 12:35:00 2180-07-25 17:55:00 NA       
 5   10000068 25022803 2160-03-03 23:16:00 2160-03-04 06:26:00 NA       
 6   10000084 23052089 2160-11-21 01:56:00 2160-11-25 14:52:00 NA       
 7   10000084 29888819 2160-12-28 05:11:00 2160-12-28 16:07:00 NA       
 8   10000108 27250926 2163-09-27 23:17:00 2163-09-28 09:04:00 NA       
 9   10000117 22927623 2181-11-15 02:05:00 2181-11-15 14:52:00 NA       
10   10000117 27988844 2183-09-18 18:10:00 2183-09-21 16:30:00 NA       
   admission_type    admit_provider_id admission_location     discharge_location
   <chr>             <chr>             <chr>                  <chr>             
 1 URGENT            P874LG            TRANSFER FROM HOSPITAL HOME              
 2 EW EMER.          P09Q6Y            EMERGENCY ROOM         HOME              
 3 EW EMER.          P60CC5            EMERGENCY ROOM         HOSPICE           
 4 EW EMER.          P30KEH            EMERGENCY ROOM         HOME              
 5 EU OBSERVATION    P51VDL            EMERGENCY ROOM         <NA>              
 6 EW EMER.          P6957U            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
 7 EU OBSERVATION    P63AD6            PHYSICIAN REFERRAL     <NA>              
 8 EU OBSERVATION    P38XXV            EMERGENCY ROOM         <NA>              
 9 EU OBSERVATION    P2358X            EMERGENCY ROOM         <NA>              
10 OBSERVATION ADMIT P75S70            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
   insurance language marital_status race  edregtime          
   <chr>     <chr>    <chr>          <chr> <dttm>             
 1 Other     ENGLISH  WIDOWED        WHITE 2180-05-06 19:17:00
 2 Medicaid  ENGLISH  WIDOWED        WHITE 2180-06-26 15:54:00
 3 Medicaid  ENGLISH  WIDOWED        WHITE 2180-08-05 20:58:00
 4 Medicaid  ENGLISH  WIDOWED        WHITE 2180-07-23 05:54:00
 5 Other     ENGLISH  SINGLE         WHITE 2160-03-03 21:55:00
 6 Medicare  ENGLISH  MARRIED        WHITE 2160-11-20 20:36:00
 7 Medicare  ENGLISH  MARRIED        WHITE 2160-12-27 18:32:00
 8 Other     ENGLISH  SINGLE         WHITE 2163-09-27 16:18:00
 9 Other     ENGLISH  DIVORCED       WHITE 2181-11-14 21:51:00
10 Other     ENGLISH  DIVORCED       WHITE 2183-09-18 08:41:00
   edouttime           hospital_expire_flag
   <dttm>                             <dbl>
 1 2180-05-06 23:30:00                    0
 2 2180-06-26 21:31:00                    0
 3 2180-08-06 01:44:00                    0
 4 2180-07-23 14:00:00                    0
 5 2160-03-04 06:26:00                    0
 6 2160-11-21 03:20:00                    0
 7 2160-12-28 16:07:00                    0
 8 2163-09-28 09:04:00                    0
 9 2181-11-15 09:57:00                    0
10 2183-09-18 20:20:00                    0
# ℹ 431,221 more rows
system.time(read_csv("~/Documents/203B/mimic/hosp/admissions.csv.gz", col_types =  cols(
  subject_id = col_integer(),
  hadm_id = col_integer(),
  admittime = col_datetime(format = "%Y-%m-%d %H:%M:%S"),
  dischtime = col_datetime(format = "%Y-%m-%d %H:%M:%S"),
  admission_type = col_character(),
  admit_provider_id = col_character(),
  admission_location = col_character(),
  discharge_location = col_character(),
  insurance = col_character(),
  language = col_character(),
  marital_status = col_character(),
  race = col_character(),
  edregtime = col_datetime(format = "%Y-%m-%d %H:%M:%S")
)))
   user  system elapsed 
  1.773   0.592   0.713 
object_size(read_csv("~/Documents/203B/mimic/hosp/admissions.csv.gz",  col_types = cols(
  subject_id = col_integer(),
  hadm_id = col_integer(),
  admittime = col_datetime(format = "%Y-%m-%d %H:%M:%S"),
  dischtime = col_datetime(format = "%Y-%m-%d %H:%M:%S"),
  admission_type = col_character(),
  admit_provider_id = col_character(),
  admission_location = col_character(),
  discharge_location = col_character(),
  insurance = col_character(),
  language = col_character(),
  marital_status = col_character(),
  race = col_character(),
  edregtime = col_datetime(format = "%Y-%m-%d %H:%M:%S")
)))
51.86 MB

Q2. Ingest big data files

Let us focus on a bigger file, labevents.csv.gz, which is about 125x bigger than admissions.csv.gz.

ls -l ~/Documents/203B/mimic/hosp/labevents.csv.gz
-rw-rw-r--@ 1 huangfeifei  staff  1939088924 Jan  5  2023 /Users/huangfeifei/Documents/203B/mimic/hosp/labevents.csv.gz

Display the first 10 lines of this file.

zcat < ~/Documents/203B/mimic/hosp/labevents.csv.gz | head -10
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

Q2.1 Ingest labevents.csv.gz by read_csv

Try to ingest labevents.csv.gz using read_csv. What happens? If it takes more than 5 minutes on your computer, then abort the program and report your findings.

Answer: It takes more than 5 minutes to ingest labevents.csv.gz using read_csv and my computer broke down.

#system.time(read_csv("~/Documents/203B/mimic/hosp/labevents.csv.gz"))

Q2.2 Ingest selected columns of labevents.csv.gz by read_csv

Try to ingest only columns subject_id, itemid, charttime, and valuenum in labevents.csv.gz using read_csv. Does this solve the ingestion issue? (Hint: col_select argument in read_csv.)

Answer: It still takes more than 5 minutes to ingest labevents.csv.gz using read_csv and my computer broke down.

#system.time(read_csv("~/Documents/203B/mimic/hosp/labevents.csv.gz", col_select = c("subject_id", "itemid", "charttime", "valuenum")))

Q2.3 Ingest subset of labevents.csv.gz

Our first strategy to handle this big data file is to make a subset of the labevents data. Read the MIMIC documentation for the content in data file labevents.csv.

In later exercises, we will only be interested in the following lab items: creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931) and the following columns: subject_id, itemid, charttime, valuenum. Write a Bash command to extract these columns and rows from labevents.csv.gz and save the result to a new file labevents_filtered.csv.gz in the current working directory. (Hint: use zcat < to pipe the output of labevents.csv.gz to awk and then to gzip to compress the output. To save render time, put #| eval: false at the beginning of this code chunk.)

Display the first 10 lines of the new file labevents_filtered.csv.gz. How many lines are in this new file? How long does it take read_csv to ingest labevents_filtered.csv.gz?

Answer:

zcat < ~/Documents/203B/mimic/hosp/labevents.csv.gz | awk -F, '($5 == 50912 || $5 == 50971 || $5 == 50983 || $5 == 50902 || $5 == 50882 || $5 == 51221 || $5 == 51301 || $5 == 50931) {print $2,$5,$7,$10}' | gzip > labevents_filtered.csv.gz
zcat < labevents_filtered.csv.gz | head -10
10000032 50882 2180-03-23 11:51:00 27
10000032 50902 2180-03-23 11:51:00 101
10000032 50912 2180-03-23 11:51:00 0.4
10000032 50971 2180-03-23 11:51:00 3.7
10000032 50983 2180-03-23 11:51:00 136
10000032 50931 2180-03-23 11:51:00 95
10000032 51221 2180-03-23 11:51:00 45.4
10000032 51301 2180-03-23 11:51:00 3
10000032 51221 2180-05-06 22:25:00 42.6
10000032 51301 2180-05-06 22:25:00 5
zcat < labevents_filtered.csv.gz | wc -l
 24855909
system.time(read_csv("labevents_filtered.csv.gz"))
   user  system elapsed 
 76.968 248.888 435.783 

Q2.4 Ingest labevents.csv by Apache Arrow

Our second strategy is to use Apache Arrow for larger-than-memory data analytics. Unfortunately Arrow does not work with gz files directly. First decompress labevents.csv.gz to labevents.csv and put it in the current working directory. To save render time, put #| eval: false at the beginning of this code chunk.

Then use arrow::open_dataset to ingest labevents.csv, select columns, and filter itemid as in Q2.3. How long does the ingest+select+filter process take? Display the number of rows and the first 10 rows of the result tibble, and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is Apache Arrow. Imagine you want to explain it to a layman in an elevator.

Answer: Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides high-performance data interchange between systems and applications.

zcat < ~/Documents/203B/mimic/hosp/labevents.csv.gz > labevents.csv
system.time({
  labevents <- arrow::open_dataset("labevents.csv", format = "csv")
  labevents_arrow_filtered <- labevents %>%
    select(subject_id, itemid, charttime, valuenum) %>%
    filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931))
})
   user  system elapsed 
  0.103   0.032   0.177 
labevents_arrow_filtered <- as_tibble(labevents_arrow_filtered)
nrow(labevents_arrow_filtered)
[1] 24855909
head(labevents_arrow_filtered, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10001217  51301 2157-12-23 19:59:00      8.8
 2   10001319  51221 2134-04-11 05:42:00     31.6
 3   10001319  51301 2134-04-11 05:42:00     10.3
 4   10001319  50912 2134-04-11 05:42:00      0.5
 5   10001319  51221 2134-04-15 00:54:00     32.9
 6   10001319  51301 2134-04-15 00:54:00     17.3
 7   10001319  51221 2135-07-19 23:22:00     33.2
 8   10001319  51301 2135-07-19 23:22:00     10.8
 9   10001319  51221 2138-09-27 15:22:00     34.8
10   10001319  51301 2138-09-27 15:22:00     10.7

Q2.5 Compress labevents.csv to Parquet format and ingest/select/filter

Re-write the csv file labevents.csv in the binary Parquet format (Hint: arrow::write_dataset.) How large is the Parquet file(s)? How long does the ingest+select+filter process of the Parquet file(s) take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is the Parquet format. Imagine you want to explain it to a layman in an elevator.

Answer: Parquet is a columnar storage file format, which is optimized for reading and writing large datasets. It is designed to provide efficient storage and encoding of data, and it is widely used in big data processing frameworks.

system.time({
  labevents_parquet <- arrow::write_dataset(labevents_arrow_filtered, "labevents.parquet")
  labevents_parquet <- arrow::open_dataset("labevents.parquet")
  labevents_parquet_filtered <- labevents_parquet %>%
    select(subject_id, itemid, charttime, valuenum) %>%
    filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931))
})
   user  system elapsed 
  2.078   0.398   2.767 

The size of the Parquet file is about 96M.

file.size("labevents.parquet")
[1] 96
labevents_parquet_filtered <- as_tibble(labevents_parquet_filtered)
nrow(labevents_parquet_filtered)
[1] 24855909
head(labevents_parquet_filtered, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10001217  51301 2157-12-23 19:59:00      8.8
 2   10001319  51221 2134-04-11 05:42:00     31.6
 3   10001319  51301 2134-04-11 05:42:00     10.3
 4   10001319  50912 2134-04-11 05:42:00      0.5
 5   10001319  51221 2134-04-15 00:54:00     32.9
 6   10001319  51301 2134-04-15 00:54:00     17.3
 7   10001319  51221 2135-07-19 23:22:00     33.2
 8   10001319  51301 2135-07-19 23:22:00     10.8
 9   10001319  51221 2138-09-27 15:22:00     34.8
10   10001319  51301 2138-09-27 15:22:00     10.7

Q2.6 DuckDB

Ingest the Parquet file, convert it to a DuckDB table by arrow::to_duckdb, select columns, and filter rows as in Q2.5. How long does the ingest+convert+select+filter process take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is DuckDB. Imagine you want to explain it to a layman in an elevator.

Answer: DuckDB is an embeddable SQL OLAP database management system. It is designed to be used as an embedded database, and it is optimized for analytical queries on large datasets.

system.time({
  labevents_parquet <- arrow::open_dataset("labevents.parquet", format = "parquet")
  labevents_duckdb <- arrow::to_duckdb(labevents_parquet)
  labevents_duckdb_filtered <- labevents_duckdb %>%
    select(subject_id, itemid, charttime, valuenum) %>%
    filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931))
  }
)
   user  system elapsed 
  0.266   0.055   0.410 
labevents_duckdb_filtered <- as_tibble(labevents_duckdb_filtered)
nrow(labevents_duckdb_filtered)
[1] 24855909
head(labevents_duckdb_filtered, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10027100  50971 2156-06-18 22:37:00      3.8
 2   10027100  50983 2156-06-18 22:37:00    136  
 3   10027100  51221 2156-06-25 21:50:00     35.7
 4   10027100  51301 2156-06-25 21:50:00      5.3
 5   10027100  50882 2156-06-25 21:50:00     26  
 6   10027100  50902 2156-06-25 21:50:00     98  
 7   10027100  50912 2156-06-25 21:50:00      0.9
 8   10027100  50931 2156-06-25 21:50:00    109  
 9   10027100  50971 2156-06-25 21:50:00      7.3
10   10027100  50983 2156-06-25 21:50:00    134  

Q3. Ingest and filter chartevents.csv.gz

chartevents.csv.gz contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/Documents/203B/mimic/icu/chartevents.csv.gz | head -10
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0

d_items.csv.gz is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/Documents/203B/mimic/icu/d_items.csv.gz | head -10
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

In later exercises, we are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items, using the favorite method you learnt in Q2.

Document the steps and show code. Display the number of rows and the first 10 rows of the result tibble.

Answer: Steps: 1. Decompress chartevents.csv.gz to chartevents.csv and put it in the current working directory.

zcat < ~/Documents/203B/mimic/icu/chartevents.csv.gz > chartevents.csv
  1. Ingest chartevents.csv.gz by Apache Arrow
system.time({
  chartevents <- arrow::open_dataset("chartevents.csv", format = "csv")
  chartevents_arrow_filtered <- chartevents %>%
    filter(itemid %in% c(220045, 220181, 220179, 223761, 220210))
})
   user  system elapsed 
  0.012   0.007   0.023 
  1. Display the number of rows and the first 10 rows of the result tibble.
chartevents_arrow_filtered <- as_tibble(chartevents_arrow_filtered)
chartevents_arrow_filtered |> 
  nrow() |> 
  print()
[1] 22502319
chartevents_arrow_filtered |> 
  head(10) |> 
  print()
# A tibble: 10 × 11
   subject_id  hadm_id  stay_id caregiver_id charttime          
        <int>    <int>    <int>        <int> <dttm>             
 1   10000032 29079034 39553978        47007 2180-07-23 14:01:00
 2   10000032 29079034 39553978        47007 2180-07-23 14:01:00
 3   10000032 29079034 39553978        47007 2180-07-23 15:00:00
 4   10000032 29079034 39553978        47007 2180-07-23 15:00:00
 5   10000032 29079034 39553978        47007 2180-07-23 15:00:00
 6   10000032 29079034 39553978        47007 2180-07-23 15:00:00
 7   10000032 29079034 39553978        66056 2180-07-23 12:00:00
 8   10000032 29079034 39553978        66056 2180-07-23 12:00:00
 9   10000032 29079034 39553978        66056 2180-07-23 12:00:00
10   10000032 29079034 39553978        66056 2180-07-23 12:00:00
# ℹ 6 more variables: storetime <dttm>, itemid <int>, value <chr>,
#   valuenum <dbl>, valueuom <chr>, warning <int>